Assessment

In Reinforcement Learning, what principle do Monte Carlo methods use to update the state value function V(s)?

For each experience tuple , use r and current estimate of V(s') to update V(s).

Play out an entire episode, then update V(s) for each state s encountered using returns from the remainder of the episode.

Use knowledge of the transition and reward probabilities to iteratively update V(s).

Learn the transition and reward probabilities first, and then use them to compute V(s).

SOLUTION:

Play out an entire episode, then update V(s) for each state s encountered using returns from the remainder of the episode.

When training a Deep Q Network, how would you remove the adverse effects of correlation between consecutive experience tuples?

Skip every alternate experience tuple, i.e. don't use it for training.

Store the experience tuples in a buffer, and then replay them in reverse order.

Store the experience tuples in a buffer, and then randomly sample a batch for every training iteration.

Correlation between consecutive experience tuples doesn't affect DQNs.

SOLUTION:

Store the experience tuples in a buffer, and then randomly sample a batch for every training iteration.

Policy Gradient methods in Deep Reinforcement Learning are better compared to value function approximation methods because:

Policy Gradient methods always find the globally optimal solution.

They can be used directly with continuous action spaces.

Policy objective functions are easier to compute than state or action value functions.

They require less examples for training.

SOLUTION:

They can be used directly with continuous action spaces.